Application of LDA and word2vec to detect English off-topic composition

This paper presents an off-topic detection algorithm combining LDA and word2vec aiming at the problem of the lack of accurate and efficient off-topic detection algorithms in the English composition-assisted review system. The algorithm uses the LDA model to model the document and train the document through the word2vec, and uses the semantic relationship between the document’s topics and words to calculate the probability weighted sum for each topic and its feature words in the document, and finally selects the off-topic composition by setting a reasonable threshold. Different F values are obtained by changing the number of topics in the document, and the best number of topics is determined. Experimental results show that the proposed method is more effective than vector space model, can detect more off-topic compositions, and the accuracy is higher, the F value is more than 88%, which realizes the intelligent processing of off-topic detection of composition, and can be effectively applied in English composition teaching.


I. Introduction
Composition is an important means to express emotion and transmit information, while the theme is the soul of composition. The most important thing in a composition is that the subject is clear, otherwise it is easy to cause confusion and misunderstanding, or even off-topic.
Off-topic detection is used to judge whether a composition is off-topic. Its core content is to calculate the similarity between texts [1]. At present, the most commonly used and classic text representation model is the vector space model, and the TF-IDF algorithm based on the vector space model is the most widely used method to calculate the text similarity. This method manifests the weight of the word by the frequency of the word appearing in the document and the frequency of the word appearing in the document collection. The similarity of the texts is calculated by calculating the cosine value between the vectors. Although the word bag model method is simple and effective, this method ignores the semantic information of the word in the document and does not take into account the semantic similarity between words. The English words "like" and "love", for example, they all mean like, but in the vector space model, they are treated as two separate lexical items. For this disadvantage, some researchers have proposed methods of word extension, such as using dictionaries Word Net, How-Net for word extension. Chen et al. [2] proposed a method to calculate semantic a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 similarity of English vocabulary based on Word Net word extension, and Wang et al [3] proposed a method to calculate semantic similarity of vocabulary based on How Net. These methods rely on artificially constructed dictionaries, and may encounter many problems when new words appear.
In this paper, a new method of text similarity calculation is proposed for the deficiency of the above methods, and it is used to test the off-topic of English composition. The algorithm models the document collection by the topic model LDA, obtains the topic and topic feature words of each document and their probability distribution, and combines with the semantic relationship between words and words obtained by word2vec training, calculates the probability weighted sum of each topic of the document, and determines whether the composition deviates from the topic.
The study aims to seek the answer to the following research questions: 1. How to use LDA and word2vec to detect the off-topic English composition?
2. Compared with vector space model-based method, how is the effectiveness of the off-topic detection method based on LDA and word2vec?

A. LDA model
A LDA (latent Dirichlet allocation) model is a three-layer Bayesian generative model of "texttheme-word" proposed by David M, Andrew Y and Michael I in 2003 [4]. It is a three-tier Bayesian probability model extended on probabilistic latent semantic analysis (pLSA). The model contains three-tier structure of words, topics and documents. This model is an unsupervised machine learning algorithm that can be used to identify potential topic information in large-scale document collections or corpora. It adopts the word bag model, which treats each document as a word frequency vector, which converts text information into digital information that is easy to model and calculate. The model is based on the premise that the document is composed of several implicit topics, which are composed of several specific words in the text, ignoring the syntactic structure in the document and the sequence of words [5].
LDA topic model can be represented by a probability graph model in the form shown in Fig 1. LDA model is determined by hyper-parameter α and β, where α represents the relative strength between implicit topics in the document collection, while β reflects the probability distribution of all implicit topics themselves. In Fig 1, M represents the number of documents in the document collection, K represents the number of topics in the document collection, N represents the number of feature words contained in each document, θ m denotes the probability distribution of all topics in the m th document, andφ k denotes the probability distribution of feature words under a particular topic.

B. Gibbs sampling
The estimation of model parameters is needed in the process of constructing LDA model. the commonly used estimation methods are mainly variational Bayesian inference, expectation propagation algorithm and collapsed Gibbs sampling. The parameter inference method based on Gibbs sampling is easy to understand and simple to implement, and can extract topics from large-scale text sets very effectively [6]. Therefore Gibbs sampling algorithm has become the most popular LDA model extraction algorithm.
Gibbs sampling is a simple and widely used MCMC (Markov chain Monte Carlo) algorithms. Arora et al. [7] proposed to apply Gibbs sampling method to parameter estimation of LDA model. Feature word probability distribution under each topic and topic probability distribution of each document are the two most important parameters in the LDA model.
The specific steps of Gibbs sampling algorithm are as follows (the algorithm specific derivation process can be detailed in the literature of Farrahi and Gaticaperez (2011) [8]: a) Initialization. The topic z i is initialized to a random integer 1~T, i cycles from 1 to N, N is the number of all specific words that appear in the text in the corpus, which is the initial state of the Markov chain. b) Cyclic sampling. After enough iterations, the Markov chain approaches the target distribution, the topic z i at this time can estimate the φ and θ values according to the following formula. Among them: n ðtÞ k represents the number of times the t th feature word appears in the k th theme; n ðkÞ m represents the number of times the k th theme appears in the m th document. The φ and θ values are obtained indirectly by Gibbs sampling and recorded as posterior probability P (z i = k|z -i ,w). The formula is Among them: because z i represents the subject variable corresponding to the i th word; ┐ i means that the i th word is not included, so z ┐ i represents the probability distribution of all topics z k (k6 ¼i); z ðtÞ k;┐i indicates that the feature word t belongs to the word frequency of topic k; z ðkÞ m;┐i represents the size of the feature word set assigned to the topic k by the document m.

C. LDA modeling process
In this paper, before LDA modeling, for a given document collection D = {d 1 , d 2 ,. . ., d M }, each document d m (d m 2D) needs to be preprocessed, which mainly includes word segmentation, de-deactivate words, de-punctuation and other operations. Each word item after processing is saved and separated with spaces, and the corresponding corpus set is obtained after sorting as the next processing data. The processed corpus is presented as a document, and the document-word matrix is constructed. The final text representation is shown in formula (4).
Among them: M represents the total number of documents; m represents the document serial number; w mn represents the n th word item in the m th document.
For each document in the corpus, LDA gives the following generation process: a. Sampling from the Dirichlet distribution α to generate topic distribution θ m of the m th document; b. Sampling from the topic's polynomial distribution θ m to generate the topic z m,n of the n th word of the m th document.
c. Sampling from the Dirichlet distribution β to generate the word distribution φz m,n corresponding to the subject z m,n .
d. Sampling from the polynomial distribution of words φz m,n to finally generate the word w m,n .
Because the LDA model theory thinks that an article has multiple topics, and each topic corresponds to different words. The construction process of an article is to choose a topic with a certain probability, and then to choose a word with a certain probability under this topic, so that the first word of this article is generated. Repeatedly, the entire article is generated. Of course, it is assumed that there is no order between words.
The parameter estimation in this paper uses the Gibbs sampling algorithm [9] of MCMC method [10], which can be regarded as the inverse process of the document generation process, that is, in the case of a known document collection (the result of document generation), the parameter value is obtained by parameter estimation. According to the model diagram in Fig 1, the probability distribution of a document can be obtained as follows: Gibbs sampling algorithm can be based on corpus training LDA model. The training process is to obtain samples of subject and feature words in document collection by Gibbs sampling. The final sample obtained after the convergence of the algorithm can estimate the parameters of the model.
Through the above steps and analysis, according to the needs of the experiment in this paper, the document-word matrix is obtained by Eq (4), the preprocessed document collection D is modeled using LDA models, thereby we obtain the theme t i and its topic probability distribution P (t i | d m ), where t i 2T, T = {t 1 , t 2 ,. . ., t K }, and get the characteristic word w n of topic t i and its probability distribution P (w n | t i ), where w n 2W, W = {w 1 , w 2 ,. . ., w N }.

III. Topic correlation calculation based on LDA and word2vec
The representation of the document by the LDA model is to extract the topic and the feature words corresponding to the topic in the form of probability, there is some uncertainty. In order to express the semantic information of the word items in the document more accurately, this paper introduces word2vec method to better express the semantic information between words. By this method, the similarity between the word items is calculated with the feature words of the topic after LDA modeling, and finally the topic relevance degree is obtained.

A. Word2vec
In recent years, with the rapid development of deep learning, word vector based on neural network is more and more concerned by researchers. Tang et al. [11] proposed a word2vec language model for computing word vectors by using Bengio' NNLM model (neural network language model) and Hinton' Log_Linear model for reference. In 2013, Google released word2vec, an open source software tool for training word vectors [12]. word2vec model [13] can quickly and effectively express a word as a vector form of real value according to a given corpus through an optimized training model. It can simplify the processing of text content into a K dimension vector operation by using the context information of words. The similarity in vector space can be used to represent the similarity in text semantics. The word vectors output by word2vec can be used for many NLP-related tasks, such as emotional classification, finding synonyms, part-of-speech analysis and so on. Another feature of word2vec is efficiency. Mikolov et al. [14] point out that an optimized single-machine version can train hundreds of billions of words a day. It provides a new tool for applied research in the field of natural language processing.
The word2vec open source toolkit can be downloaded via the official website, where the three files of word2vec.c, demo-word.sh and distance.c are associated with training word vectors. The word2vec.c file contains the implementation of each model of word2vec, the demoword.sh contains a list of the parameters specified for the model training, and the distance.c file calculates the cosine values between different word vectors. Word vector training can be performed on word2vec through the demo-world.sh script. word2vec contains two training models, the architecture models used are CBOW (continuous bag-of-words) and skip-gram model respectively. The principle diagram is shown in Fig 2. It is obvious from Fig 2 that both CBOW and skip-gram model contain input layer, projection layer and output layer. Among them, the CBOW model predicts the word vector of the current word through the context, that is, the continuous words corresponding to the context of the current word are represented in the form of a word bag, and the trained target vector is selected as the summation of the context word vector. While the skip-gram model generates word vectors exactly the opposite of the CBOW model, which predicts its context only by the current word. By these two models, word2vec can consider the context information very comprehensively, so it can achieve better results.

B. Calculation of subject correlation
Before calculating the subject correlation degree of the document, we need to train the document collection through the word2vec to get the semantic information between the words.
For the English corpus of this paper, word2vec can identify different words according to the space between words. After word2vec training, the vector representation of each word can be obtained, and the cosine value of the two vectors is calculated to represent the semantic similarity distance of the two words. The larger the cosine value, the closer the semantics of the two words are, such as two n dimensional vectors a (x 11 , x 12 ,. . ., x 1n ) and b (x 21 , x 22 ,. . ., x 2n ). The formula for calculating cosine values is as follows: The word vector information obtained after training is stored in a file, which is convenient for subsequent steps to calculate the similarity of the word vectors.
Based on the information obtained above, the cosine similarity cos (w j ,w n ) of the word item w j and the feature word w n under the t i topic is calculated using word2vec. Then the correlation degree between the word item w j and the topic t i is the probability weighted sum S (w j , t i ) of the cosine similarity of each feature word under the word item w j and the topic t i , which can be expressed by the following formula. So we can get the correlation degree of word item w j and document d m , that is, probability weighted sum S (w j , d m ) of the relevance of each topic of w j and d m , which is expressed as: Finally add the S (w j , d m ) values of each word item of the document. The formula is as follows:

IV. Off-topic detection algorithm
First, the off-topic detection algorithm preprocesses the document collection, establishes the document-word matrix after the preprocess, then models the document collection by the LDA model, obtains the document topic and its distribution and the feature word under the topic and its distribution. Then use word2vec to train the document collection and save the results of the training, and then combine the LDA and word2vec information. Finally, each document is screened according to the threshold set in this article, so as to find out the off-topic documents.
The off-topic detection algorithm can not only obtain the topic information of the document by LDA, but also obtain more accurate semantic information contained in the word through the word vectors trained by word2vec. The specific steps of off-topic detection algorithm are as follows: c. LDA modeling. Model each document in the document-word matrix built by the above steps. From formulas (1)(2), the topic probability distribution θ m of the m th document and the value of the probability distribution of the feature words under the k th topic are obtained respectively. Sort according to the probability value from large to small order, so as to get the topic of each document and its probability distribution and feature words and their probability distribution. For example, 60% of the probability distribution of the theme of an English document is discussing education, and 40% is about children. Under the education theme, feature words such as "school", "students", and "education" will appear. Under the theme of children, the feature words are "children", "women" and "family" and so on.
d. Use word2vec to train word vectors. A preprocessed document collection is used as input, trained with word2vec, and output as word vectors corresponding to each word. Using the generated word vector, the distance (similarity) between the words is calculated and specified by formula (6). Such as specifying the word "woman", will show the word "man" which is closest to "woman" in the trained text and the cosine distance between them is 0.685. After training, the semantic information between the words in the document can be expressed, and it becomes vector information and saved.
e. The topic correlation degree of the document is calculated with LDA and word2vec. The cosine similarity of each word item to each feature word under the i th topic after LDA modeling is calculated by word2vec. Use formula (7) to calculate the probability weighted sum of each feature word, then calculate the probability weighted sum of each topic according to formula (8), and finally, according to formula (9), the topic correlation degree of each word item is weighted and the total correlation degree is determined, and the off-topic composition is selected according to the threshold.
The LDA model in the algorithm models the document collection and uses Gibbs for sampling to indirectly obtain the model parameters. Through parameter estimation, we can get the probability distributions of different topics and the probability distributions of the feature words of different topics. In order to more accurately represent the semantic information in the document, the algorithm adds word2vec to train word vectors. This method uses a lowdimensional space representation, which not only solves the problem of dimensional disaster, but also mines the related attributes between words, thereby improving the semantic accuracy of the text. In summary, the algorithm combines the respective advantages of LDA and word2vec. The result of word2vec training makes the semantic relationship between words in the document more accurately expressed, so that LDA model can effectively determine whether the topic of the document itself is relevant. The topic correlation degree of the document is obtained in the low-dimensional semantic space, and the off-topic document can be detected by correlation degree.

V. Experimental results and comparative analysis
This paper collected 1200 college English compositions under six different titles, each with 200 compositions. The students come from Zhejiang Technical Institute of Economics, and they are the students I teach. From their English class work I obtained the writing samples. I obtained permission from my institutional ethics committee. Each composition was graded manually. There are a certain number of off-topic compositions under each topic. The full score of the composition is 15. If the result of manual marking is less than 5, this paper thinks that the composition is off-topic. The off-topic composition detected by the experimental results is compared with the off-topic composition graded manually, and a comprehensive evaluation and analysis is carried out from the accuracy rate, the recall rate and the F value to verify the effectiveness and practicability of the algorithm in the experiment.
The accuracy rate refers to the ratio of the number of correctly detected off-topic relevant compositions to the total number of detected off-topic compositions. The accuracy rate is expressed by P. The recall rate refers to the ratio of the number of correctly detected off-topic relevant compositions to the number of all off-topic related compositions, and the recall rate is expressed by R. Suppose that T is used to represent the number of relevant off-topic compositions correctly detected by the system, A is used to represent the total number of off-topic compositions detected by the system, and the total number of off-topic related compositions is expressed by B, then the formula for calculating the accuracy and recall rate is as follows: It is known from the meaning of formulas (10)(11) that in general, the higher the accuracy rate, the lower the recall rate, and the higher the recall rate, the lower the accuracy rate. F value can reconcile the influence of their mutual restraint. It is a comprehensive index which takes into account the accuracy and recall rate. The formula is as follows: From formula (12), we can see that because F value comprehensively considers the results of accuracy and recall, when it is higher, it indicates that the algorithm is more ideal. In the experiment, the LDA model uses Gibbs sampling. In the process of modeling the document topic, first assume that the number of topics K = 2, in this experiment the hyperparameter α takes the empirical value, α = 50/K, which changes with the number of topics, the hyperparameter β also takes a fixed empirical value, β = 0.01. In order to ensure the accuracy of the experimental results, the number of Gibbs sampling iterations is set to 1000.
When using word2vec to train the document collection, because word2vec provides many hyperparameters to adjust the training process, choosing different parameters will affect the quality of the word vector generated by the training and the speed of training. We can know the different parameters and the meaning represented by each parameter in word2v training by consulting literature [15]. According to the requirements of this experiment, the parameter setting of training document collection with word2vec is shown in Table 1.
Suppose the number of topics K = 2, according to the algorithm designed in Fig 3, after modeling the document by LDA and combining it with word2vec, by comparing the off-topic documents obtained by selecting a certain threshold with the results of manual labeling, according to formulas (10)~(12) to get the corresponding off-topic detection accuracy, recall rate and F value, and finally calculate the average results of the six topics. The results are shown in Table 2.
From Table 2, we can see that when the number of topics is 2, the average accuracy rate is 83.83%, the average recall rate is 89.08%, and the average F value is 86.22%. To achieve the best results of off-topic detection, by changing the number of topics in the document in the experiment, the changing trend of the number of topics and F value is obtained, and then the optimal number of topics in LDA modeling is determined, and finally the final results of the experiment is gotten according to the optimal number of topics.
Since a document will have multiple topics, experiment changes value K which is the number of topics in the document, and the value of K is selected as 2, 3, 5, 10, 15, 20, 25, 30 in sequence. The experiment is carried out with different topic numbers, and the corresponding off-topic documents are obtained after selecting a certain threshold. According to the comparison with the previously manually marked scoring results, the accuracy rate, recall rate and F  Fig 3.  From Fig 3, it can be clearly seen that the average F value changes with the number of different topics, and it is found that when the number of topics is 15, the average F value reaches the highest. Therefore this paper can determine the optimal number of topics to be 15. At the same time, it is found that with the increase of the number of topics, the iteration time of the experiment will also increase.
It was found in the experiment that when value K which is the number of topics in the document is changed, the value of the hyperparameter α will also change accordingly. The value of K is inversely proportional to α. Obviously, the larger the value of K, the smaller the value of α, indicating that each document contains more topics. For the feature words under each topic in the experiment, it has been proved by Wu & Wang in 2015 [16] that when selecting five feature words, good results will be obtained, so in this experiment, we select five feature words for each topic of each document.
After conducting experiments based on the optimal number of topics determined in this article and comparing the off-topic compositions detected by the experimental results with the off-topic compositions with manual annotation and scoring, the average accuracy of off-topic detection under the six different titles was finally obtained. The average accuracy is 89.63%, the average recall is 88.03% and the average F is 88.15%.  A comparison experiment is also carried out with the TF-IDF algorithm based on vector space model. Vector space model is based on a critical assumption that the characteristics of the content expressed in an article and the order or location composing the article entries are irrelevant, but related to how often these semantic units appear in the article. Vector space model representation refers to looking for feature words, also called keywords, in a document. The keyword can represent the content of the document. Give keywords weight through a certain algorithm. The text is represented by feature terms and weights.  Fig 4. The specific process is: (1) the input of text and system parameters; (2) determining the candidate words, including filtering stop words, participles and recording the specific location of the candidate words in the document; (3) Word weights are calculated through the TF-IDF algorithm, the weights are sorted, and the first n words were extracted as keywords, that is, the set of subject terms. These term sets constitute an n-dimensional feature vector.
The TF-IDF calculation method is the most commonly used weight calculation method in vector space models, consisting of two parts: TF and IDF. TF is the frequency with which a given word appears in a document. DF refers to the frequency of keywords appearing in multiple documents, IDF is the reciprocal of DF, that is, keywords that appear in all N documents. For the k th feature item t ik in text d i , the corresponding weight calculation method is: a. TF stands for Term frequency. In practical applications, the TF value needs to be normalized to avoid statistical deviations caused by too long documents. If t i,k appears m i,k times in the text d i , then b. IDF stands for Inverse document frequency. The frequency of feature terms in the global text set D is: Assuming that there are N documents in the global text set, and the feature term t i,k appeared in the n i,k th text, then: In the comparative experiment, the same English composition document is used as the corpus. Firstly, the corpus is preprocessed, and then the document is expressed as a vector of vocabulary items using the TF-IDF algorithm. Secondly, the cosine similarity between the composition to be tested and the given five model compositions is calculated. Then according to the similarity result, the mean value is processed as the result of the composition, and finally, the off-topic compositions are selected according to the threshold. Similar to the evaluation method used in this experiment, F value was used as the evaluation index at the end of the experiment, and the average F value of the off-topic detection obtained by the six groups of experiments was 78.62%.
The comparison results of the F values of this method and TF-IDF algorithm based on vector space model are shown in Fig 5. According to the comparative analysis of the experimental results of the two methods, we can see from Fig 5 that the proposed algorithm in the paper is more effective, can accurately analyze the semantic information of word items in the composition, and can also get the distribution of topics in the composition. These factors are very helpful to detect whether the composition is off-topic. Under the condition of ensuring certain accuracy, compared with the TF-IDF algorithm of vector space model, this algorithm can detect more off-topic compositions, and F value is obviously improved, and the algorithm has reliability. In the comparative experiment, it was found that in the two groups of experiments, the method proposed in this paper found all the off-topic compositions under the topic, with high accuracy, but the TF-IDF algorithm based on the vector space model did not detect all the off-topic compositions under the topic. In the undetected compositions, it is found that there are compositions with the score of 0. Although the contents of the compositions are not blank, the subjects are off-topic. The algorithm proposed in this paper can detect such compositions well. This fact also reflects one of the biggest shortcomings of the TF-IDF algorithm based on the vector space model. It is only calculated by TF (word frequency) and IDF (inverse document frequency), and it cannot effectively judge the semantics of the words themselves in the document and has certain limitations.
The off-topic compositions detected by off-topic detection algorithm in this paper can reach more than 88%, and the accuracy rate is relatively high. At the same time, it is more effective than the TF-IDF algorithm under vector space model, and can efficiently screen out the off-topic compositions in a short time, which can save a lot of time for teachers.

VI. Conclusion
We use LDA to model the composition, which can easily extract the topic of the composition and its feature words, and train it with word2vec. The results after training can express the semantics between words more accurately. Then we use LDA and word2vec to calculate the topic correlation degree of the composition. The experimental results show that the off-topic composition is effectively detected by this algorithm. The algorithm proposed in this paper has the function of intelligent auxiliary to the marking of English compositions. The algorithm can process English compositions quickly, objectively, fairly, and automatically, and screen out the off-topic compositions under the corresponding topics, which reduces the influence of teachers' subjective factors and further improves the efficiency of marking compositions. It makes up for the shortcomings that the manual method can not quickly and effectively detect a large number of English compositions in a short time. Compared with the traditional vector space model, this method can not only obtain more semantic information between words, but also obtain the topic distribution of compositions by modeling compositions, which makes up for the deficiency of traditional vector space model method which does not consider the semantic information of words themselves.
In this paper, when using LDA modeling to determine the number of topics, only the F value is used as a reference, and no better calculation theory for determining the number of topics is considered. Considering that the LDA model is easy to expand, the next step will be prepared to study and improve the method of document modeling and topic number determination based on the LDA model.